Research Purpose: The goal of this project is to evaluate whether female financial analysts' earnings announcement forecast behavior is different compared with the male analyst. Then I will build a KNN model given firm earnings announcement characteristics to predict post-10 days buy and hold return.
Motivation: It is widely documented in the literature that financial analysts and their forecasts are important sources of information for investors. Analysts provide both forward-looking information innovation and analyze information already released to the market, thereby bridging the information gaps between the public firms and the investors, leveling the playing ground among investors, reducing overall information asymmetry, and enhancing market efficiency. This project mainly focuses on female financial analysts during the covid-19 period, which investigates the impact of the Covid-19 pandemic on female forecast behaviors compared with males. For the prediction part, since financial analysts provide valuable information for investors, it is possible to predict the buy and hold return 10 days after the earnings announcement given each earning announcement released firm fundamental information and analyst forecast information.
I will start with the analyst forecast data and covid-19 data discussion and summary statistics. Then I will do some illustrated graphs to give a clue about the two research ideas. In the last part of this project, I will present the main results of two research ideas and discuss the results.
The Institutional Brokers' Estimate System, or IBES is a database of analyst estimates and company guidance for more than 23,000 public companies. The database aggregates all of the available financial data on companies and company sectors to aid in decision-making. It features a host of data from equity analyst consensus to forward guidance. Historical data is available from 1976 when IBES was introduced, with international data going back to 1987.
I build up the sample by first obtaining analyst forecast data from Thomson Reuters Institutional Brokers Estimate System (I/B/E/S) starting from March 2019, one year before the shock of the Covid-19 pandemic officially ascertained by the WHO, through November 2021, roughly keeping even sample periods before and after the pandemic shock. Specifically, I use data files of I/B/E/S detail history (unadjusted) for quarterly analyst forecasts as measured by earnings per share (EPS) for US companies. In particular, I only select analyst forecasts whose forecast period indicator (FPI) from I/B/E/S is either 6 or 7, namely forecasts only for the current and the next fiscal quarter to ensure the timeliness of the forecast-related measures. Furthermore, I adjust all estimate and earnings announcement dates to the closest preceding trading date in CRSP to match the corresponding adjustment factors. Then the estimates are adjusted by CRSP adjustment factors to ensure the same per-share basis as the company reported EPS.
The covid-19 case data is from New York Times Covid-19 database. The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. This compiled panel data is from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak. Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
The data has been used to power New York Times maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists, and government officials who would like access to the data to better understand the outbreak.
To carry on the analyst forecast analysis, I need to prepare the forecast data set and covid-19 dataset correctly and then merge them. This section will first go through the forecast dataset and check whether there are any outliers or tons of missing data in each variable. Since to merge the analyst data with the covid-19 dataset, the two merge keys would be state and Year_month. Since the state in the analyst forecast data means the analyst living location, I need to further check whether the analyst is clustering at certain states. In the second step, I will check covid-19 data with two graphs to see whether the covid-19 data is following our intuitions about covid-19 spreading in the US.
import pandas as pd
import numpy as np
data1 = pd.read_csv("ana1.csv")
data2 = pd.read_csv("ana2.csv")
data3 = pd.read_csv("ana3.csv")
data4 = pd.read_csv("ana4.csv")
data5 = pd.read_csv("ana5.csv")
data6 = pd.read_csv("ana6.csv")
data7 = pd.read_csv("ana7.csv")
data = pd.concat([data1, data2])
data = pd.concat([data, data3])
data = pd.concat([data, data4])
data = pd.concat([data, data5])
data = pd.concat([data, data6])
data = pd.concat([data, data7])
# View All Columns in dataset.
print(data.columns[1:])
Index(['Unnamed: 0', 'AMASKCD', 'ANALYST', 'ESTIMID', 'TICKER', 'ESTIMATOR',
'ANALYS', 'VALUE', 'FPEDATS', 'REVDATS', 'REVTIMS', 'ANNDATS',
'ANNTIMS', 'permno', 'basis', 'repdats', 'act', 'new_value',
'Top_Broker', 'fqtr', 'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage',
'ROA', 'Cash_holding', 'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual',
'EARNGROWTH', 'ANNDATS2', 'REVDATS2', 'repdats2', 'Year_Quarter',
'Year_Month', 'SIC4', 'gender', 'city', 'state_code', 'Zipcode'],
dtype='object')
As we could see in the above results, there are in total 41 variables in the analyst dataset. In this section, I will exhibit the summary statistics and introduce definitions of the main variable.
AMASKCD:The analyst mask ID in the IBES dataset.
ESTIMID:The analyst working brokage in the IBES dataset.
TICKER:The analyst forecasted firm ticker.
act: is real earning announcement value released on the firm earning announcement day.
new_value: is analyst forecast earning announcement value released before the firm earning announcement day.
Top_Broker: is dummy variable to label whether the analyst is working on famous broker in the US.
fqtr: firm fiscal quarter.
EXPERIENCE: current forecast issuing year minus the year of analyst first release forecast.
EXPWITHFIRM: current forecast issuing year minus the year of analyst first release forecast for a specific firm.
Size: firm foundamental. Calculated with log(Stock Price * shares)
Leverage: firm leverage ratio.
ROA: firm return on asset ratio.
Cash_holding: firm cash holding.
RD: firm research and development expenditure.
Total_asset: firm total Assets, including long term assets and short term assets.
BM: firm book to market ratio.
INSTOWN: firm institutional share holding ratio.
accrual: firm accrual.
EARNGROWTH: firm earning growth, calculated by (current earnings - last year same fiscal quarter earnings)/last year same fiscal quarter earnings.
ANNDATS2: analyst forecast released date.
REVDATS2: analyst forecast revised date.
repdats2: earning announcement date.
Year_Quarter: analyst forecast year quarter.
Year_Month: analyst forecast year month.
SIC4: firm industry code.
gender: analyst gender.
city: analyst living city.
state_code: analyst living state.
Zipcode: analyst living zipcode.
# Generate Summary Statistics.
data[['act', 'new_value', 'Top_Broker', 'fqtr',
'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage', 'ROA', 'Cash_holding',
'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual', 'EARNGROWTH',
'ANNDATS2', 'REVDATS2', 'repdats2', 'Year_Quarter', 'Year_Month',
'SIC4', 'gender']].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| act | 725112.0 | 8.348247e-01 | 2.579519 | -2.174900e+02 | 3.000000e-02 | 5.300000e-01 | 1.270000e+00 | 1.577800e+02 |
| new_value | 725112.0 | 7.157306e-01 | 1.186827 | -2.170000e+00 | 4.000000e-02 | 4.800000e-01 | 1.150000e+00 | 5.030000e+00 |
| Top_Broker | 725112.0 | 2.577409e-01 | 0.437391 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 |
| fqtr | 725112.0 | 2.497406e+00 | 1.109664 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 8.000000e+00 |
| EXPERIENCE | 725112.0 | 1.372030e+01 | 8.865287 | 0.000000e+00 | 7.000000e+00 | 1.200000e+01 | 1.900000e+01 | 4.000000e+01 |
| EXPWITHFIRM | 725112.0 | 5.302115e+00 | 5.332798 | -1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 8.000000e+00 | 3.700000e+01 |
| size | 713746.0 | 8.532651e+00 | 1.943188 | -3.147107e-01 | 7.262915e+00 | 8.536149e+00 | 9.884393e+00 | 1.465897e+01 |
| Leverage | 663543.0 | 3.850836e-01 | 40.032064 | -2.951367e+03 | 2.713711e-01 | 7.262438e-01 | 1.491229e+00 | 9.731578e+02 |
| ROA | 714079.0 | -4.052547e-04 | 0.023202 | -1.105349e+01 | -4.650750e-05 | 2.534570e-05 | 1.360850e-04 | 7.376654e-02 |
| Cash_holding | 656937.0 | 1.432306e-01 | 0.177776 | 0.000000e+00 | 2.925610e-02 | 8.352428e-02 | 1.775876e-01 | 9.995069e-01 |
| RD | 714463.0 | 1.264210e-02 | 0.037436 | -2.631101e-01 | 0.000000e+00 | 0.000000e+00 | 1.174265e-02 | 4.905942e+00 |
| Total_asset | 714322.0 | 4.412519e+04 | 203511.230850 | 1.960000e-01 | 1.597763e+03 | 5.906564e+03 | 2.160970e+04 | 3.757576e+06 |
| BM | 713624.0 | 5.893034e-01 | 1.593991 | -2.184231e+02 | 1.532555e-01 | 3.652840e-01 | 7.641880e-01 | 7.051203e+01 |
| INSTOWN | 714374.0 | 7.524225e-01 | 0.245178 | 2.827255e-07 | 6.469856e-01 | 8.122404e-01 | 9.186487e-01 | 1.530762e+01 |
| accrual | 711789.0 | -9.209817e-02 | 1.117691 | -1.979402e+02 | -7.639224e-02 | -2.578883e-02 | -1.885338e-03 | 1.734346e+01 |
| EARNGROWTH | 686999.0 | 2.417483e-05 | 0.027642 | -1.102792e+01 | -5.415531e-05 | 1.124101e-06 | 6.143918e-05 | 3.312084e+00 |
| repdats2 | 725112.0 | 2.020353e+07 | 8950.598631 | 2.019011e+07 | 2.020021e+07 | 2.020102e+07 | 2.021061e+07 | 2.022062e+07 |
| Year_Quarter | 725112.0 | 2.020025e+05 | 78.872093 | 2.019010e+05 | 2.019040e+05 | 2.020020e+05 | 2.021010e+05 | 2.021040e+05 |
| Year_Month | 725112.0 | 2.020062e+05 | 78.912889 | 2.019010e+05 | 2.019100e+05 | 2.020050e+05 | 2.021020e+05 | 2.021120e+05 |
| SIC4 | 719962.0 | 5.710675e+03 | 2680.208923 | 1.700000e+02 | 3.639000e+03 | 5.812000e+03 | 7.372000e+03 | 9.999000e+03 |
| gender | 725112.0 | 9.597276e-02 | 0.294554 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
As we can see from the above table, the act variable value is within the range of -214 to 157, which is huge before we adjust with the stock split and dividends. The new_value variable is the analyst forecast earning per share between -2.17 to 5.03, which is consistent with our normal intuition of earning per share range. All the other variables are within the normal intuitive range given their definition.
import plotly.express as px
d = data[['AMASKCD', 'state_code']].drop_duplicates()
d = d.groupby('state_code').count().reset_index()
fig2 = px.choropleth(d,
locations='state_code',
locationmode="USA-states",
scope="usa",
color='AMASKCD',
color_continuous_scale='Viridis_r')
fig2.update_layout(
title_text = 'Analyst Location Distribution',
title_font_family="Times New Roman",
title_font_size = 22,
title_font_color="black",
title_x=0.45)
As the above figure indicates, the most analyst is located in New York State, because New York City has lots of financial institutions. This will give our first research topic a big problem, since our results may drive by a New York-based analyst. To address that, I will introduce a test model after the main results model to check whether New York is the main driven force of our results.
In this section, I will first present the summary statistics of covid-19 data, then I will introduce two figures about the Covid-19 outbreak in the US to check whether the Covid-19 outbreak data is aline with our intuition. I will first lay out the covid-19 cases data on the date of 2020-03-01, which is just the beginning of Covid-19 in the US. In the second graph, I will lay out the Covid-19 cases data on the date 2021-11-25.
covid = pd.read_excel("covid_cases.xlsx")
covid[['fips', 'cases', 'deaths', 'new_cases', 'avg_new_cases', 'avg_new_cases_r']].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| fips | 39816.0 | 32.535714 | 18.905041 | 1.0 | 17.750000 | 31.500000 | 46.250000 | 78.0 |
| cases | 39816.0 | 384163.296112 | 670592.798311 | 0.0 | 12456.500000 | 124343.000000 | 476375.500000 | 5892644.0 |
| deaths | 39816.0 | 6916.022001 | 11643.375990 | 0.0 | 264.000000 | 2186.500000 | 8282.000000 | 77042.0 |
| new_cases | 39816.0 | 1469.386352 | 3947.989144 | -40527.0 | 30.000000 | 378.000000 | 1360.250000 | 193786.0 |
| avg_new_cases | 39816.0 | 1414.094922 | 3031.222743 | -4502.0 | 95.285714 | 481.714286 | 1457.642857 | 69956.0 |
| avg_new_cases_r | 39816.0 | 0.018572 | 0.136090 | -1.0 | -0.024179 | 0.000000 | 0.039257 | 1.0 |
fips: US states code.
cases: Covid-19 cases number.
deaths: Covid-19 death number.
new_cases: Covid-19 new cases number.
avg_new_cases: Covid-19 new cases past 7 days average number.
avg_new_cases_r: Covid-19 new cases growth rate past 7 days average number.
The above statistics table listed the variable definition and variable value range in the dataset. We could see that the cases and death numbers have a huge range since the dataset recorded covid-19 cases and death numbers from the beginning. For the daily new cases number and average new cases number, since the 7-day average new cases number smoothing the number of the new cases. Therefore, we could see the standard deviation of new cases is smaller than the average number of new cases.
fig1 = px.choropleth(covid[covid['date'] == '2020-03-01'],
locations='state code',
locationmode="USA-states",
scope="usa",
color='cases',
color_continuous_scale="Viridis_r")
fig1.update_layout(
title_text = 'Covid 19 Cases on 2020-03-01',
title_font_family="Times New Roman",
title_font_size = 22,
title_font_color="black",
title_x=0.45)
As the above figure indicates, there is a very small value in the cases when covid-19 cases first outbreak in the US on 2020-03-01.
fig2 = px.choropleth(covid[covid['date'] == '2021-11-25'],
locations='state code',
locationmode="USA-states",
scope="usa",
color='cases',
color_continuous_scale='Viridis_r')
fig2.update_layout(
title_text = 'Covid 19 Cases on 2021-11-25',
title_font_family="Times New Roman",
title_font_size = 22,
title_font_color="black",
title_x=0.45)
As the above figure indicates, there are above million covid-19 cases across each state on the date 2021-11-25. Both graphs are quite following our intuition about the covid-19 spreading across the US. As time passes by, covid-19 accumulated cases will go up across all the states.
In this section, I will perform the code of merging the analyst forecast data and the Covid-19 dataset based on time and state_code.
covid['date'] = pd.to_datetime(covid['date'])
covid = covid.rename(columns = {"date":"ANNDATS2",
"state code":"state_code"})
data['ANNDATS2'] = pd.to_datetime(data['ANNDATS2'])
data = pd.merge(data, covid[['ANNDATS2', 'state_code', 'cases', 'deaths',
'new_cases', 'avg_new_cases', 'avg_new_cases_r']], how = "left", on = ["ANNDATS2", "state_code"])
data[['cases', 'deaths', 'new_cases', 'avg_new_cases', 'avg_new_cases_r']] = data[['cases', 'deaths',
'new_cases', 'avg_new_cases', 'avg_new_cases_r']].fillna(0)
# Dummy return.
data['rnewcase_pos_d'] = np.int64(data['avg_new_cases_r'] > 0)
data['rnewcase_neg_d'] = np.int64(data['avg_new_cases_r'] < 0)
In this section, I will calculate several important variables that proxy the behavior of financial analysts based on their earning announcement earning per share forecast. These variables include Pessimism, COVERAGESIZE, Herding, Updating Frequency, Bold_pos, Bold_neg, Bold_d, Rounding, Reissue, Forecast Age, and Rec_Chg. The definition of these each variable will be listed below:
# 180 Day Consensus and Pessimism.
data = data.sort_values(["repdats2", 'permno', 'ANNDATS2', 'ANNTIMS']).reset_index(drop=True)
ll2 = data.groupby(["repdats2", 'permno']).rolling('180D', min_periods=1, on = "ANNDATS2")['new_value'].mean().reset_index()
ll2 = ll2.rename(columns={"new_value": "acc_avg180"})
error2 = data.groupby(["repdats2", 'permno']).rolling('180D', min_periods=1, on = "ANNDATS2")['new_value'].std().reset_index()
error2 = error2.rename(columns={"new_value": "acc_std180"})
data['shift_new_value'] = data.groupby(['AMASKCD', 'permno', 'repdats2'])['new_value'].shift()
data['acc_avg180'] = ll2['acc_avg180']
data['acc_std180'] = error2['acc_std180']
data['Pessimism_d180'] = np.int64(data['new_value'] < data['acc_avg180'])
# Firm Covered.
temp = data[['AMASKCD', 'permno', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'permno'])
temp = temp.drop_duplicates()
num_firm = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_firm = num_firm.rename(columns={'permno': 'Firm Covered'})
# COVERAGESIZE
lag_num_firm = num_firm.groupby(["AMASKCD"]).shift()
lag_num_firm = lag_num_firm.rename(columns = {"Firm Covered":"L_Firm_Covered"})
num_firm['COVERAGESIZE'] = lag_num_firm["L_Firm_Covered"]
data = pd.merge(data, num_firm, on = ['AMASKCD', 'Year_Month'], how = 'left')
# Forecast_Number.
temp = data[['AMASKCD', 'permno', "repdats", 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'permno', "repdats"])
temp = temp.drop_duplicates()
num_ea = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_ea = num_ea.rename(columns={'permno': 'Forecast_Number'})
data = pd.merge(data, num_ea[['AMASKCD', 'Forecast_Number', 'Year_Month']], on = ['AMASKCD', 'Year_Month'], how = 'left')
# Updating Frequency.
temp2 = data[['AMASKCD', 'new_value', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'new_value'])
up_freq = temp2 .groupby(['AMASKCD', 'Year_Month']).count().reset_index()
up_freq = up_freq.rename(columns={'new_value': 'Updating Frequency'})
data = pd.merge(data, up_freq, on = ['AMASKCD', 'Year_Month'], how = 'left')
# DIstinct SIC code number in pre-period.
temp = data[['AMASKCD', 'SIC4', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'SIC4'])
temp = temp.drop_duplicates()
num_SIC = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_SIC = num_SIC.rename(columns={'SIC4': 'SIC Covered'})
lag_num_SIC = num_SIC.groupby(["AMASKCD"]).shift()
lag_num_SIC = lag_num_SIC.rename(columns = {"SIC Covered":"L_SIC_Covered"})
num_SIC['COVERAGEFOCUS'] = lag_num_SIC["L_SIC_Covered"]
data = pd.merge(data, num_SIC, on = ['AMASKCD', 'Year_Month'], how = 'left')
# DIstinct Forecast number in pre-period.
temp = data[['AMASKCD', 'new_value', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month'])
num_fore = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_fore = num_fore.rename(columns={'new_value': 'Forecast Issued'})
lag_num_fore = num_fore.groupby(["AMASKCD"]).shift()
lag_num_fore = lag_num_fore.rename(columns = {"Forecast Issued":"L_Forecast_Issued"})
num_fore['FORECASTFREQ_LAG'] = lag_num_fore["L_Forecast_Issued"]
data = pd.merge(data, num_fore, on = ['AMASKCD', 'Year_Month'], how = 'left')
# Herding.
data = data.sort_values(by = ['AMASKCD', 'permno', 'repdats2', 'ANNDATS2', 'ANNTIMS']).reset_index(drop=True)
kk = data[['AMASKCD', 'permno','repdats2', 'ANNDATS2','ANNTIMS','new_value', 'acc_avg180']]
kk['shift'] = kk.groupby(['AMASKCD', 'permno', 'repdats2'])['new_value'].shift()
herding = np.int64((kk['new_value'] > kk['acc_avg180']) & (kk['new_value'] < kk['shift'])) + np.int64((kk['new_value'] < kk['acc_avg180']) & (kk['new_value'] > kk['shift']))
data['Herding'] = herding
# Bold_pos.
data["bold_pos"] = np.int64((data['new_value'] > data['acc_avg180']) & (data['new_value'] > data['shift_new_value']))
# Bold_neg.
data["bold_neg"] = np.int64((data['new_value'] < data['acc_avg180']) & (data['new_value'] < data['shift_new_value']))
# Bold_d.
data["bold_d"] = data["bold_neg"] + data["bold_pos"]
# Rounding.
data['Rounding'] = np.int64(data['new_value']*100%5 == 0)
# Reissue.
data['Reissue'] = np.int64(data['REVDATS2'] != data['ANNDATS2'])
# Forecast Age.
data['days'] = pd.DataFrame(pd.to_datetime(data['repdats']) - pd.to_datetime(data['ANNDATS2']))[0].dt.days
data['Forecast_Age'] = np.log(data['days']+1)
data = data[data.Forecast_Age != -np.inf]
data = data.replace(-np.inf, np.nan)
data = data.replace(np.inf, np.nan)
# Rec_Chg
data = data.sort_values(by = ['AMASKCD', 'permno', 'repdats2', 'ANNDATS2', 'ANNTIMS']).reset_index(drop=True)
kk = data[['AMASKCD', 'permno','repdats2', 'ANNDATS2','ANNTIMS','new_value', 'acc_avg180']].copy()
kk = kk.groupby(['AMASKCD', 'permno', 'repdats2'])["ANNDATS2", 'new_value'].shift()
data['ANNDATS2_lag'] = kk["ANNDATS2"].copy()
data['new_value_lag'] = kk['new_value'].copy()
data['Rec_Chg'] = np.int64(data['new_value'] > data['new_value_lag'])
data['Rec_Chg'] = data['Rec_Chg'] - np.int64(data['new_value'] < data['new_value_lag'])
data = data.rename(columns = {'Pessimism_d180':'Pessimism',
"bold_pos":"Bold_pos",
'bold_neg':"Bold_neg",
'bold_d':'Bold_d'})
data2 = data[['Year_Month', 'Pessimism', 'COVERAGESIZE', 'Herding', 'Updating Frequency',
'Bold_pos', 'Bold_neg', 'Bold_d', 'Rounding', 'Reissue',
'Forecast_Age', 'Rec_Chg', 'gender']].copy()
C:\Users\xu000\AppData\Local\Temp\ipykernel_6816\267675118.py:59: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\xu000\AppData\Local\Temp\ipykernel_6816\267675118.py:88: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
In this section, I will present some preliminary results of the first research proposal. I will draw both male and female forecast measures across pre-covid to post-covid to check whether females behave differently compared with males.
The following code is served to standardize the time range.
import seaborn as se
import matplotlib.pyplot as plt
data2['year'] = np.int32(data2['Year_Month']/100)
data2['year'] = data2['year'].values.astype('str')
data2['month'] = data2['Year_Month']%100
data2['month'] = data2['month'].values.astype('str')
data2['YM'] = data2['year'] + '-' + data2['month']
data2['YM'] = pd.to_datetime(data2['YM'])
data2['gender'] = data2['gender'].map({0:'Male', 1:'Female'})
In the pessimism graph below, we could see that the red line is the date 2020-03-01, which is well known as the outbreak of covid-19. Before the covid-19 outbreak, the pessimism of males and females is not very different, but after the break out of covid-19, female is more pessimism compared with a male as indicated in the graph.
P = data2[['YM','Pessimism','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = P, x = 'YM', y = "Pessimism", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
<matplotlib.lines.Line2D at 0x296d3052f70>
Herding measures whether analysts tend to navigate their original earnings per share forecast toward the median forecast issued by another analyst. As the Herding figure indicates, before the covid-19 outbreak, females tend to herd more compared with males. This trend didn't change after the covid-19 outbreak. Whether female analysts heard more after the covid-19? this question needs more rigorous statistical tests.
H = data2[['YM','Herding','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Herding", hue = 'gender')
plt.axvline('2020-03-01', 0, 1, color = 'red')
<matplotlib.lines.Line2D at 0x296837a1f70>
Updating Frequency measures how frequently the analyst issues new forecasts. As the updating frequency figure indicates, males tend to update more compared with female after the covid-19, but given male is updating more compared with a female before the covid-19, this variable still needs rigorous statistical tests.
H = data2[['YM','Updating Frequency','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Updating Frequency", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
<matplotlib.lines.Line2D at 0x2968564e400>
Rounding measures whether analyst tends to manipulate their forecast with an ending digit of 0 or 5. As the following graph indicates, it's hard to detect that females have more rounding behavior than males before and after the covid-19.
H = data2[['YM','Rounding','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Rounding", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
<matplotlib.lines.Line2D at 0x296855c37f0>
Bold_d measures whether the number of boldness forecasts issued by the analyst. More boldness forecast could provide investors with more information about the incoming earnings announcement. Boldness forecast is correlated with analyst career length and employment risk. In the below graph, the female analyst doesn't have a significant intention to the male analyst in issuing bold forecasts.
H = data2[['YM','Bold_d','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Bold_d", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
<matplotlib.lines.Line2D at 0x29685602940>
The reissue measures whether the analyst is less confident in their formal issued forecasts. The graph listed below don't have strong evidence in showing female is tending to issue more reissue forecast than male.
H = data2[['YM','Reissue','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Reissue", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
<matplotlib.lines.Line2D at 0x296855c3850>
$Y_{ijt} = D_t + T_i + \beta D_t*T_i + X_it + A_i + Time_t + Firm_j + e_{ijt}$
The $Y_{ijt}$ will be the dependent variable that we want to test. $D_t$ is the dummy variable, if post 03-01-2020 (post covid-19), then label 1. $T_i$ is the dummy variable if analyst is female then label 1. $X_it$ is the analyst characteristics control and firm foundamental control. $A_i$ is the analyst fixed effect, which is used for allivate analyst level related ommited variables bias. $Time_t$ is the time fixed effect which is used for control the Macroeconomy variables. $Firm_j$ is the firm fixed effects, which is used for control the persistent firm foundamental characteristics.
For the robustness test of whether the results is driven by new york analyst, I will perform following analysis. I justed added a new control variable named $\beta_2 D_t*T_i*NewYork_c$. If the $\beta_2$ is significant and $ \beta$ is insignificant, then it indicates this results is driven by new york. If $ \beta$ is significant even after control the $\beta_2 D_t*T_i*NewYork_c$, then the results is not partially driven by new york.
$Y_{ijtc} = D_t + T_i + \beta D_t*T_i + \beta_2 D_t*T_i*New York_c + X_it + A_i + Time_t + Firm_j + e_{ijtc}$